Explainable semi-supervised deep learning shows that dementia is associated with small, avocado-shaped clocks with irregularly placed hands

The clock drawing test is a simple and inexpensive method to screen for cognitive frailties, including dementia. In this study, we used the relevance factor variational autoencoder (RF-VAE), a deep generative neural network, to represent digitized clock drawings from multiple institutions using an optimal number of disentangled latent factors. The model identified unique constructional features of clock drawings in a completely unsupervised manner. These factors were examined by domain experts to be novel and not extensively examined in prior research. The features were informative, as they distinguished dementia from non-dementia patients with an area under receiver operating characteristic (AUC) of 0.86 singly, and 0.96 when combined with participants’ demographics. The correlation network of the features depicted the “typical dementia clock” as having a small size, a non-circular or “avocado-like” shape, and incorrectly placed hands. In summary, we report a RF-VAE network whose latent space encoded novel constructional features of clocks that classify dementia from non-dementia patients with high performance.

. Example clock drawing tests with Libon scoring. (A) Clock drawing to command and copy conditions by the first individual. Clocks were scored on various metrics by the Libon criteria. This scoring system adjudicated the command clock to have atypical shape, hand placement and numbers' spacing. It also adjudicated the copy clock to have erroneous number spacing. (B) Clock drawing to command and copy conditions by the second individual. The Libon criteria decided that the command clock has atypical size, hand placement, numbers' spacing, and is drawn in counter-clockwise direction. The copy clock was evaluated to have anomalous shape, hand placement, numbers' spacing and to have digit repetitions (perseveration).

Results
Participants. This study is a multi-center, multi-cohort study performed in collaboration between the University of Florida and the Rowan University, New Jersey. Three cohorts were used in this study namely-training cohort, fine-tuning cohort and testing cohort. Table 1 shows the participants' demographics in the training and classification (fine-tuning and testing) cohorts. All participants in the classification cohort completed both command and copy condition drawings. Three individuals in the training cohort could not complete the command condition. In the classification cohort, dementia participants were significantly older, had lower MMSE scores, and had fewer years of education than their non-dementia peers. The training cohort had an equal percentage of male and female participants, whereas the classification cohort was predominantly male. Furthermore, there were significantly more male individuals in the dementia cohort. Both the training and classification cohorts had a predominance of white people. Figure 2A shows the RF-VAE trained latent space after completion of unsupervised training with 23,521 clock drawings from both command and copy conditions. Each column corresponds to one latent dimension, and represents traversal over the latent space along that dimension. Due to disentanglement, there was no cross-correlation between these latent dimensions in the training dataset ( Supplementary Fig. 1). Figure 2B defines the nature of each latent variable and elucidates its change over the corresponding latent dimension.

RF-VAE latent space (training dataset).
Column A shows a change in the brightness of the clock drawing. In reality, this corresponds to the size of the clock drawing as clocks of various sizes were resized to 64 × 64 during preprocessing, resulting in a decrease in the brightness of the larger clocks. Column B shows the existence of ovate and obovate (avocado-shaped) clocks in the training dataset. The direction of orientation of the obovate clock reverses as this latent dimension increases. This increase is associated with a lengthening of the clock hands. Column C encodes the change of clock shape from prolate (elongated) to oblate (flattened) with an increase in its latent dimension. Column D shows an upward movement of the point of intersection of the clock hands from the geometric clock center, with an increase in its latent dimension. Column E shows the presence of eccentric ellipsoidal clock drawings. The direction of the eccentricity of ellipsoidal clocks changes from left to right as this latent dimension increases. Column F shows an increase in the angle between the clock hands as its latent dimension increases. Column G shows the existence of non-circular clocks in the dataset. An increase in this latent dimension changes the clock shape from square to circular to rhomboid. Column H again shows ellipsoidal clocks, but in this case, the orientation changes from right to left as the latent dimension increases. Therefore, this dimension is the logical opposite of the fifth latent dimension. Column I shows the presence of clocks that have a horizontal circular asymmetry (side bulge). The side bulge changes position from left to right as the latent dimension increases. Column J shows a rotation of the clock hands while maintaining a constant inter-hand angle. This indicates clocks where the subject put hands in numbers other than 11 and 2 or a general shift in the placement of digits in the www.nature.com/scientificreports/ clock. These are the ten disentangled constructive imperfections identified by the RF-VAE network from the training dataset. In the case of all factors, a shift towards higher absolute value of the latent variable is associated with the loss of digits on the clockface ( Fig. 2A).

RF-VAE latent space (classification dataset).
All clocks in the classification dataset contained these anomalies to different degrees. Supplementary Figs. 2A-I show the distribution of each feature among dementia and non-dementia participants. Figure 3A shows the comparison between mean and standard deviations of each feature between dementia and non-dementia groups after removing confounding effects of age and education   Table 1). (B) Number of "atypical occurrences" of latent variables in the dementia group. These atypical occurrences were calculated by the number of times a latent variable value in a dementia clock was greater than two standard deviations from the mean of the control distribution.  Fig. 3A to Fig. 2A shows that dementia clocks are considerably smaller, obovate, oblate clocks with vertically displaced hands having large angle between them. Rotation of the clock hand assembly showed the maximum drop in significance after compensating for age and education differences (Supplementary Table 1). Square-rhomboid and side-bulge have bi-modal dementia distributions and unimodal non-dementia distributions ( Supplementary Fig. 2G,I) although they are not significantly different between dementia and non-dementia groups. Furthermore, we found the number of "atypical occurrences" of each feature in the dementia group by comparing them against the mean and standard deviation of the respective non-dementia distribution (Fig. 3B). Size has the highest number of atypical occurrences in the dementia cohort. Square-rhomboid and side-bulge have the least number of atypical occurrences in the dementia cohort. Size, obovateness, prolate-oblateness, vertical displacement of clock hand assembly, and rotation of clock hand assembly are most frequently atypical in dementia clocks. We examined the cross-correlation between different latent variables on the classification dataset and found the presence of positive and negative correlations (Fig. 4A). We used these correlations as adjacency values of a graph to represent the relations between the latent variables in a graphical format (Fig. 4B). The graph depicts the presence of three subnetworks characterized by relatively high intra-network positive correlation (correlation > 0.2) and inter-network negative correlations (correlation < − 0.2). The three subnetworks comprise (a) obovate-eccentricity, (b) vertical displacement of clock hands-square/rhomboid, and (c) prolate/oblate-angle between clock hands. Prolate/oblate is negatively correlated with eccentricity and obovate. Vertical displacement of clock hands is negatively correlated with eccentricity. Furthermore, clock size and rotation angle of clock hand assembly show a weak positive correlation (correlation ~ 0.1). Clock size is negatively correlated with square/ rhomboid. Clock hand rotation is negatively correlated with prolate/oblate. Finally, the dementia label is correlated with small clock size, avocado-shape, flattening of the clock face (oblateness), eccentricity, increasing angle between hands, and anticlockwise rotation of the hand assembly.
Classification performance (fine-tuning and testing datasets). We simultaneously fine-tuned the weights of the RF-VAE encoder and trained a neural network classifier with the fine-tuning dataset. The ten www.nature.com/scientificreports/ latent variables generated by the RF-VAE encoder were input to the classifier firstly as standalone features and secondly with demographics (age, sex, race, and years of education) for distinguishing dementia from nondementia. The test dataset was used to report the final performance metrics on both occasions, as shown in Table 2. 95% confidence intervals show the robustness of the model's performance over bootstrapped versions of the test data. The model achieves good performance on the test data simply using the ten latent variables and achieves almost perfect classification when demographics are added to the model. The classification performance using solely demographic information is presented for reference.

Discussion
RF-VAE delineated ten constructional features in clocks drawn by participants as part of a routine medical assessment in a preoperative setting. The ten constructional factors are as follows (1) size, (2) degree and orientation of obovate, (3) prolate-oblate, (4) vertical displacement of the point of intersection of clock hands, (5) degree and direction of ellipticity, (6) angle between clock hands, (7) square-rhomboid clockfaces, (8) degree and direction of ellipticity in an opposite sense than (5), (9) degree and direction of side-bulge of clockface, and (10) rotation of clock hands assembly. These factors are deemed independent generative factors that are significant sources of variation in clock drawings by the unsupervised training of a RF-VAE. Each clock comprised a superposition of these factors to different degrees. Statistical comparison of the different latent features between dementia and non-dementia showed that in our dataset dementia was most typically associated with small, avocado-shaped, oblate clocks with irregularly placed hands. Figure 5A shows a hypothetical clock drawing comprising a combination of the latent variables most highly associated with dementia in our dataset. Figure 5B shows the clock which was given the highest probability of being dementia by our neural network classifier.
These latent variables could distinguish dementia from nondementia peers with superlative performance, and the addition of age, sex, race, and years of education resulted in the near-perfect classification of dementia from non-dementia in the test dataset. The model's high performance using standalone latent variables as features proves that these features are highly informative of the participants' cognitive status. However, significant improvement upon the addition of demographics proves that demographics still contain non-redundant information necessary for the classification of dementia from controls.
The factors discovered in this study are generally different from traditional analog metrics used to score a clock drawing test, such as digit placement accuracy, missing digits, hand placement accuracy and the ratio of hour hand to minute hand length. The RF-VAE latent variables generally describe a global change in the shape of the clockface and placement of clock hands, whereas dCDT features describe salient high resolution graphomotor and latency variables from the CDT. Despite broad differences, some similarities exist. The ratio between the lengths of major and minor axes in a clock drawing is reflected in the fifth and eighth latent dimensions (degree and direction of the eccentricity of the ellipsoid) of RF-VAE. Similarly, hand misplacement corresponds to latent dimensions four (vertical displacement of the point of meeting of clock hands), six (angle between clock hands), and ten (rotation of clock hand assembly from 11 and 2). Figure 5C,D show which factors are atypically expressed in the CDTs shown in Fig. 1. By comparing Fig. 1A,B with Fig. 5C,D we can appreciate that the RF-VAE factors represent the graphomotor elements of a clock drawing in a novel and more nuanced way than traditional scoring criteria.
Despite disentanglement being a requirement in discovering these features, some features are algorithmically associated. For example, a more oblate clock will have greater angle between clock hands, and a change in the shape of the clock face from circular to square can vertically displace the clock hand assembly. These relations are reflected in the three subnetworks found from the classification dataset's cross-correlation patterns between variables. These data show that the statistical disentanglement achieved by RF-VAE does not necessarily translate to algorithmic independence between the features. Despite statistical disentaglement, the presence of algorithmic dependence between different constructional aspects of the clock drawing can result in correlations between variables in smaller datasets such as our classification cohort. Achieving algorithmic independence between generative features is a possible future course of research in this area. Finally, the weak positive correlation between size and clockwise rotation of the clock hand assembly defines the ideal clockface.
Some of these factors have been identified by domain experts as important in classifying different subtypes of dementia and other cognitive frailties. For instance, a smaller clockface area is associated with subcortical disease profiles with primary executive dysfunction (e.g., micrographia in Parkinson's disease) 9 , and misplacement of Table 2. Performance of classifier on test data. AUC area under the curve, C.I confidence interval, NPV negative predictive value. *p value < 0.01. **p value < 0.001. www.nature.com/scientificreports/ clock hands is associated with visual attention deficits and disinhibition 2 . In comparison to a previously published VAE encoding 27 , the RF-VAE encoding reported in this study achieved significantly better results on the same classification dataset using identical training methods. This improvement is due to diversification of the latent space, and disentangling the latent dimensions. Enlarging the latent space allowed us to encode more sources of variations, while disentagling them ensured minimal mutual information. This project advances bidirectional translational neuroscience with AI. Here, we have used the final result of dCDT to develop and validate a RF-VAE model for identifying dementia in a forward-translational experiment. Clinicians and domain experts can review the disentangled factors identified by the RF-VAE latent space in concert with their classification performance to understand novel feature combinations from the CDT and incorporate them in gold-standard cognitive assessments. This bidirectional opportunity allows domain experts to broaden their understanding of classic cognitive assessments while simultaneously driving the research in futuristic AI technologies with their invaluable domain expertise. This symbiotic association of domain expertise with progressive AI technologies is crucial for fields sensitive to domain-level concerns such as interpretability and mechanistic grounding.
This study has certain limitations. Firstly, the classification performance improvement observed due to demographic features may be traced to the differences in average age and education level between dementia and nondementia groups. However, this is in line with previous literature that have shown that higher age and lower education increase the risk of dementia in older adults [28][29][30] . Secondly, the preprocessing step involved resizing all clock images irrespective of their initial size to 64 × 64. This resulted in invariable obfuscation of key clock features such as the shape of digits and the presence of ticks and arrowheads, which can explain their absence from the trained RF-VAE latent space. Furthermore, although the RF-VAE has achieved statistical disentanglement between the latent dimensions, the presence of correlations in the classification dataset points to algorithmic dependence between at least some of these features. Finally, the classification task of separating dementia from www.nature.com/scientificreports/ non-dementia is considerably general and might not be able to leverage the richness of features identified in the RF-VAE latent space. In summary, this study showed that factorized VAEs could compress a CDT into a set of highly informative, statistically disentangled latent dimensions. These latent dimensions serve as generative features of the CDT and possess key information on characterizing dementia. We trained the RF-VAE in a completely unsupervised manner and agnostic to any cognitive outcome so that it can identify general, robust features that are informative to any downstream classification task. Thus, the same latent space can be fine-tuned to any downstream classification task related to clock drawings. Due to this advantage inherent in semi-supervised learning, in the future, we aim to represent different cognitive stressors (e.g., surgery, trauma) with a unique combination of the latent variables described here. This will also enable us to better understand and predict the prognosis of cognitive ailments through the CDT. Furthermore, we plan to use the reported RF-VAE latent space to distinguish different types of dementia such as AD, VaD, mild cognitive impairment (MCI), amnestic-MCI, dysexecutive-MCI, and Parkinson's disease. Since our model relies only on the outcome of the CDT it can leverage large amounts of publicly available CDT data for enriching the performance of its disease-specific classifiers.

Conclusion
In conclusion, in this study we have identified a complete and mutually independent set of graphomotor anomalies which are meaningful sources of variation in the CDT. We have constructed neural network classifiers using these graphomotor features with and without the assistance of participant demographics. Our models were cross-validated for optimal performance and tested on an independent testing cohort to achieve superlative performance in distinguishing dementia from non-dementia clock drawings. In the future, we will expand this study to include post-surgical cognitive dysfunction, Parkinson's disease and specific types of dementia. We will also use independent publicly available datasets to further validate the features found in this study. This study is a pioneering work in generative feature learning using semi-supervised deep neural networks on clock drawing data.

Methods
Participants. Study materials were collected from digital clock drawing consortium data between the University of Florida (UF) and New Jersey Institute for Successful Aging (NJISA), Memory Assessment Program, School of Osteopathic Medicine, Rowan University. The Institutional Review Boards of the University of Florida and Rowan University approved the study. Study participants at both institutions gave their written approval to be included in the study through informed consent forms. All study procedures were carried out per the Declaration of Helsinki and respective university guidelines and TRIPOD criteria 31 . The study consisted of two data cohorts: Training dataset included a set of 23,521 clock drawings from 11,762 participants aged ≥ 65 years, primary English speaking, who completed clock drawing to command and copy conditions as part of routine medical care assessment in a preoperative setting 32 . Exclusion criteria were as follows: non-fluent in the English language; education < 4 years; visual, hearing, or motor extremity limitation that potentially inhibits the production of a valid clock drawing.
Classification dataset consists of a "fine-tuning" dataset and a "testing" dataset used to fine-tune and test dementia versus non-dementia neural network classifier, respectively. These datasets comprise clock drawings from individuals diagnosed with dementia and non-dementia peers. The dementia clocks were collected from 56 participants evaluated through a community memory assessment program within Rowan University. They were seen by a neuropsychologist, a psychiatrist, and a social worker. Inclusion criteria: age ≥ 55. Exclusion criteria: head trauma, heart disease, or other major medical illness that can induce encephalopathy; major psychiatric disorders; documented learning disability; seizure disorder or other major neurological disorder; less than 6th-grade education, and history of substance abuse. All individuals with dementia were assessed using the Mini-Mental State Examination (MMSE), serum studies and an MRI scan of the brain. These individuals have been described in previous studies 33 . As reported in previous studies, they were either diagnosed with AD or VaD using standard diagnostic criteria 34,35 .
A total of 175 non-dementia participants completed a research protocol consisting of neuropsychological measures and neuroimaging. Two neuropsychologists reviewed all data. Inclusion criteria: age ≥ 60, English primary language, availability of intact activities of daily living (ADLs) as per Lawton and Brody's Activity of Daily Living Scale, completed by both the participant and their caregiver 36 . Exclusion criteria: clinical evidence of major neurocognitive disorder at baseline, as per the Diagnostic and Statistical Manual of Mental Disorders-Fifth Edition 37 , presence of a significant chronic medical condition, major psychiatric disorder, history of head trauma/neurodegenerative disease, documented learning disorder, epilepsy or other significant neurological illness, less than 6th grade education, substance abuse in the past year, major cardiac disease, and chronic medical illness-induced encephalopathy. These participants were screened for dementia over the telephone using the Telephone Interview for Cognitive Status (TICS 38 ) and one in-person interview with a neuropsychologist and a research coordinator who also evaluated comorbidity rating 39 , anxiety, depression, ADLs, neuropsychological functioning, and digital clock drawing 40 . Data from these participants have been described in other studies 3,19 . Procedure. Cohort participants completed two clock drawings: (a) command condition where they were instructed to "Draw the face of a clock, put in all the numbers, and set the hands to ten after eleven", and (b) the copy condition wherein the participant was presented with a model of a clock and asked to copy the same underneath it 2 . A digital pen from Anoto, Inc. and associated smart paper 17  www.nature.com/scientificreports/ was folded in half, giving participants a drawing area of 8.5 × 5.5 inch. Only the final drawing was extracted and used for analyses in the current study. Clock drawings to both command and copy conditions from the training cohort were used to train the RF-VAE. After that, clock drawings to both command and copy conditions from the fine-tuning cohort were used to train the weights of a neural network classifier and fine-tune the weights of the RF-VAE encoder to distinguish dementia from control clocks. Command and copy clocks were not separated in training because we wanted the model to learn clock encodings that are agnostic to any cognitive outcome and hence generalizable to multiple different classification tasks. The fine-tuning dataset comprised 84 dementia and 263 nondementia clocks. Ultimately, the classification network was tested on the test dataset comprising 28 dementia and 87 control clocks.
Individual clock drawings were extracted from the file using contour detection. The extracted contours were cropped to the boundaries of the clock drawing, padded with white space to a square, and resized to 64 × 64, as this was the only size supported by the RF-VAE implementation 25 used. Supplementary Fig. 3 shows the preprocessing pipeline described above.
Statistical testing. The latent features developed by the RF-VAE were tested for statistical difference between dementia and non-dementia cohorts using two-tailed Student's T-tests with multiple comparisons correction using the Benjamini-Hochberg method 41 with FDR = 0.01. The confounding effects of age and education were removed using propensity score matching using the open-source Python library called PsmPy 42 . This gave us a propensity-score matched cohort of 110 dementia clocks and 220 non-dementia clocks. Significance shown in Fig. 3A were based on adjusted p-values estimated on this propensity-matched cohort, as shown in Supplementary Table 1. Correlation between the variables was calculated using Pearson's Product Moment Correlation coefficient. Thereafter, the correlation matrix was thresholded at 0.2 and − 0.2 as these values represented 5th and 95th percentiles in the non-parametric distribution of the correlation values. The thresholded binary matrix was used as an adjacency matrix to generate a cross-correlation graph between the latent variables.

Models and experimental setup. A variational autoencoder (VAE) represents a generative model that
can learn a lower-dimensional representation of input data in the form of the mean and standard deviation of a Gaussian distribution which it samples to reconstruct the input data. The non-linear output decoder network compensates the loss of generality caused by the prior normal distribution. One disadvantage of the VAE latent distribution is a lack of disentanglement of factors: each latent variable being exclusively responsible for the variation of a unique aspect of the input data. In this paper, we have used an existing implementation of a VAE-based deep autoencoder model that can learn all meaningful sources of variations in clock drawings in its disentangled latent representation. This model, called RF-VAE, uses total correlation (TC) in the latent space to improve disentanglement of relevant sources of variation while tolerating significant KL divergences from nuisance prior distributions while simultaneously identifying factors having low divergence from these nuisance priors as "nuisance sources of variation". This way, it can learn "all meaningful sources of variations" in its latent space.
The preprocessed clock image was fed to the RF-VAE network with the latent dimension of 10. The RF-VAE network was trained for 1400 epochs at a learning rate of 10 −4 with a batch size of 64 following recommendations in source articles 25,43 . The reconstruction loss was cross-entropy, and the optimizer was Adam 44 . RF-VAE training took 3.5 h, on a GeForce GP102 Titan × GPU from NVIDIA Corporation. The trained latent space of the RF-VAE was fed to a fully connected feed-forward neural network with two hidden layers having seven neurons in the first hidden layer and four neurons in the second hidden layer. Using an Adam optimizer, the classifier was trained using the fine-tuning dataset for 20 epochs, with a batch size of 32 and a learning rate of 0.0075. The classification loss was binary cross-entropy. A 3.125:1 weight was assigned to the dementia class during training to ameliorate the class imbalance in the fine-tuning dataset. All hyper-parameters were selected using the fine-tuning dataset inside a fivefold cross-validation design by maximizing the average fold AUC of the model. Figure 6 shows the network architecture and represents our method's conceptual workflow. The top portion of each panel in the figure shows the training process of the RF-VAE. The bottom portion of the figure shows how the trained encoder weights of the RF-VAE support a task-specific classifier. The performance of this trained classifier was tested on the test data, and several important performance metrics, namely, AUC, Accuracy, Sensitivity, Specificity, Precision, and Negative Predictive Value (NPV), were reported. The test data were bootstrapped 100 times using random sampling with replacement to create confidence intervals. The median score, 2.5th quartile, and 97.5th quartile of these metrics over the bootstrapped test dataset were reported.
We evaluated the performance gain of the classifier upon the addition of age, sex, race, and years of education of participants to the model. The best-performing classifier consisted of three hidden layers with ten input neurons, 512 neurons in the first hidden layer, 256 neurons in the second hidden layer, and 128 neurons in the third hidden layer. It was trained for 20 epochs over the fine-tuning data with a batch size of 8, at a learning rate of 0.0075. All hyper-parameters were selected using the fine-tuning dataset inside a fivefold cross-validation design by maximizing the average fold AUC of the model. Figure 6 illustrates the different steps in the workflow. In the training stage, a ten-dimensional RF-VAE latent space is constructed by minimizing the loss between original and reconstructed clock drawings and minimizing the total correlation between latent dimensions to disentangle them. Furthermore, feature relevance is ensured in the latent space by eliminating those latent variables that do not diverge significantly from previously defined prior distributions. In the classification stage, the trained encoder is fine-tuned jointly with a fully connected neural network for classifying dementia from non-dementia clocks in the classification stage. Furthermore, age, sex, race, and years of education are added to the latent dimensions to train another classifier with higher performance. www.nature.com/scientificreports/